In Sections 12.1 - 12.3 we saw how supervised and unsupervised learners alike can be extended to perform nonlinear learning via the use of arbitrary linear combination of nonlinear functions / feature transformations. However the general problem of engineering an appropriate nonlinearity for a given dataset has thus far remained elusive. In this Section we introduce the first of two major tools for handeling this task: universal approximators. Universal approximators are families of simple nonlinear feature transformations whose members can be combined to create artbitrarily complex nonlinearities like any we would ever expect to find in a supervised or unsupervised learning dataset. Here we will also introduce the three standard types of universal approximators employed in practice today - kernels, neural networks, and trees - of which we will have much more to say of in future Chapters.
# imports from custom library
import sys
sys.path.append('../../')
import autograd.numpy as np
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'
%matplotlib notebook
# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
from matplotlib import rcParams
rcParams['figure.autolayout'] = True
%load_ext autoreload
%autoreload 2
One can think of a universal approximator the way we think about the periodic table of atomic elements and physical substances - with any substance reduciable to a combination of its members - by themselves members of a universal approximator are simple, but when combined appropriately they can create any nonlinear feature transformation desired (also called universal approximation).
What we will see here is that any nonlinearity $h\left(\mathbf{x}\right)$ - whether it be an appropriate nonlinear model for a supervised learning dataset, or nonlinear encoder / decoder for unsupervised data, can be approximated by a linear combination of $B$ universal approximators $f_1,\,f_2,\,...\,f_B$ as
where each of the nonlinear functions / universal approximators on the right hand side can have internal parameters.
In a perfect world our desired approximations $\approx$ can attain a strict equality $=$ i.e., $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = y_p$. In such a perfect instance the dataset lies precisely on a line (or more generally, a hyperplane). The most perfect dataset we could possibly have for linear regression is then - ignoring computation - a line (or hyperplane) itself (as illustrated below in the animated gif).
In complete analogy to the linear case, here our perfect dataset would consist of points where $\text{model}\left(\mathbf{x}_p, \mathbf{w}\right) = y_p$, i.e., points lying perfectly on the nonlinear curve (or in general the nonlinear surface) defined by $f$. The most perfect dataset we could possibly have for nonlinear regression is then - ignoring computation - a curve (or surface) itself (as illustrated below in the animated gif).
Examining our setup, here for linear two-classification our perfect dataset would consist of points where $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = y_p$, i.e., points lying perfectly on the step function with linear boundary given by $\text{model}\left(\mathbf{x},\mathbf{w}\right) = 0$. The most perfect dataset we could possibly have for linear classification is then - ignoring computation - a perfect step function (with linear boundary) itself. This is illustrated below in the animated gif.
Examining our setup, here for nonlinear two-classification our perfect dataset would consist of points where the nonlinear decision boundary given by $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = 0$ perfectly separates the data. In other words, that our points lie perfectly on the step function with this nonlinear boundary. The most perfect dataset we could possibly have for nonlinear classification is then - ignoring computation - a perfect step function (with nonlinear boundary) itself. This is illustrated below in the animated gif.
Before jumping in to discussing organized families of nonlinear functions, here we illustrate how adding together random functions can fail to provide the sort of fine-grained nonlinear modeling we are after. Here we will explore fitting a combination of four nonlinear functions to the noisy sinusoidal dataset shown in the next Python cell.
# create instance of linear regression demo, used below and in the next examples
csvname = datapath + 'noisy_sin_sample.csv'
demo1 = nonlib.demos_part_2.Visualizer(csvname)
demo1.show_pts()
While we saw in the previous post that we could determine an appropriate nonlinearity, here suppose this is the sort of dataset whose nonlinearity we will have to find via searching over combinations of nonlinear features.
First, suppose we try out various combinations of the following three functions
\begin{equation} f_1(x) = x ~~~~~~~~~~~~~~ f_2(x) = x^2 ~~~~~~~~~~~~~~ f_3(x) = \text{sinc}(10x + 1) \end{equation}which we plot in the next Python cell.
# plot our features
demo1.plot_feats(version = 1)
We will try combinations of these three features in succession. That is we will first try using the first one with a predict function that is simply the weighted sum of $f_1$
Remember here the notation $\omega$ denotes the entire set of weights used (here just $w_0$ and $w_1$). We fit this model to the given dataset by minimizing the corresponding Least Squares cost.
Next we will try a weighted combination of the first two elements $f_1$ and $f_2$
\begin{equation} \text{predict}(x,\omega) = w_0 + w_1\,f_1(x) + w_2\,f_2(x) \end{equation}tuning these weights properly by fitting to the dataset.
Finally we try a weighted combination of all three features as
\begin{equation} \text{predict}(x,\omega) = w_0 + w_1\,f_1(x) + w_2\,f_2(x) + w_3\,f_3(x) \end{equation}tuning the weights by fitting to the dataset.
With the weights of all three models tuned lets plot the resulting representations. In the left, middle, and right panels below we plot the dataset and the resulting fit from the first, second, and third models respectively.
# plots showing the first (left), second (middle), and third (right) model predictions
demo1.show_fits(version = 1)
Here none of the predictions are adequate. With the first two we have failed to fully capture the nonlinear nature of the data, and with the third we have greatly overestimated this nonlinearity.
Moreover, while the nonlinearity of the first two predictions changes fairly gradually, the leap in nonlinearity from the second to third predictions is substantial. This is due to the fact that while the first two features - monomials of two successive degrees - are closely related, the third feature is drastically different than the first two. With such drastically different features we are unable to fine tune dial the sort of nonlinearity we want.
If we swap out the third sinc feature for the next degree three monomial term - giving the three features
we not only gain significantly more fine-grained control over how much nonlinearity we introduce into our prediction, but here can produce a much better result.
We plot all three of these features in the next Python cell.
# plot our features
demo1.plot_feats(version = 2)
And now - just as previous - we use the first of these features, a linear combination of the first two, and then finally a combination of all three to make predictions. In the next cell we plot the first, second, and third predictions in the left, middle, and right panels respectively.
# show first round of fits
demo1.show_fits(version = 2)
By using three related features - as opposed to having one odd-ball nonlinearity - we get a fantastic fit using all three features this time. Moreover the change in nonlinearity is more gradual, predictable, and controllable across the three predictions.
This example highlights an important point: we need to be organized and thoughtful in our search for the right combination of features. Simply taking random combinations of various nonlinear functions as a predictor e.g., like
\begin{equation} \text{predict}(x,\omega) = w_0 + w_1\text{tanh}\left(w_2 + w_3x\right) + w_4e^{w_5x} + w_6x^{10} + \cdots \end{equation}does not generally allow for fine-tuning of nonlinearity in a prediction (nor effecient computation, as we will see). Conversely if we retrict ourselves to using a set of related functions we can better manage the amount of nonlinearity in our models.
The catalog of kernel functions consists of groups of functions with no internal parameters, a primary example being polynoimals. When dealing with just one input this sub-family of functions looks like
\begin{equation} f_1(x) = x, ~~ f_2(x) = x^2,~~ f_3(x)=x^3,... \end{equation}and so forth with the $D^{th}$ element taking the form $f_D(x) = x^D$. A combination of the first $D$ units from this sub-family is often referred to as a degree $D$ polynomial. There are an infinite number of these functions - one for each positive whole number - and they are naturally ordered by their degree (the higher the degree, the further down the list that function is). Moreover each element has a fixed shape - e.g., the monomial $f_2(x) = x^2$ is always a parabola, and looks like a cup facing upwards. In the next Python cell we plot the first four elements (also called units) of the polynomial sub-family of kernel functions.
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()
# plot the first 4 elements of the polynomial basis
demo.show_1d_poly()
In two inputs $x_1$ and $x_2$ polynomial units take the analagous form
\begin{equation} f_1(x_1,x_2) = x_1, ~~ f_2(x_1,x_2) = x_2^2, ~~ f(x_1,x_2) = x_1x_2^3, ~~ f(x_1,x_2) = x_1^4x_2^6,... \end{equation}with a general degree $D$ unit taking the form
\begin{equation} f_m(x_1,x_2) = x_1^px_2^q \end{equation}where where $p$ and $q$ are nonnegative integers and $p + q \leq D$. A degree $D$ polynomial in this case is a linear combinatino of all such units. This sort of definition generalizes to defining polynomial units in general $N$ dimensional input as well.
In the Python cell below we draw a sampling of these polynomial units.
# draw a few polynomial units in two inputs
demo.show_2d_poly()
In this example we animate various predictions of $B$ polynomial units, which generally take the form
\begin{equation} \text{predict}\left(\mathbf{x}, \omega \right) = w_0 + {w}_{1}\,f_1\left(\mathbf{x}\right) + {w}_{2}\,f_2\left(\mathbf{x}\right) + \cdots + w_B\,f_B\left(\mathbf{x}\right). \end{equation}to a variety of simple datasets. Notice in each how - due to the fact that these features come from the same sub-family of related nonlinear functions - that the corresponding predictions change fairly gradually as additional units are added to the model (in other words - we how can fine-tune the amount of nonlinearity we want in our prediction).
First in the next cell we animate the final trained predictions of fitting first $B = 1$ through $B = 4$ polynomial units to the noisy sinusoid dataset shown in the previous Subsection. As the slider is moved from left to right one additional polynomial unit is added to the model, the linear combination of units is fit to the dataset, and the resulting fit is shown on the data in the left panel. In the right panel we simultaenously show the Least Squares cost function value of each trained model.
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='poly',num_elements = [v for v in range(1,5)])
Next, below in the following Python cell we animate the final trained predictions of fitting first $B = 1$ through $B = 10$ polynomial units to a three-dimensional noisy sinusoid dataset. As the slider is moved from left to right one additional polynomial unit is added to the model, the linear combination of units is fit to the dataset, and the resulting fit is shown on the data. We do not plot the corresponding Least Squares cost function value, but as with the above example it decreases as we add more polynomial units to the model.
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_units = [v for v in range(1,10)] ,view = [20,110],basis = 'poly')
And finally, an analagous example with 2-class classification. As you move the slider from left to right additional polynomial units are added to the model. In this case each notch of the slider adds multiple units - as we are increasing in $D$ the degree of the total polynoimal (which, for two inputs, means adding several individual polynomiali units each time $D$ is increased).
# load in dataset http://math.arizona.edu/~dsl/
csvname = datapath + '2_eggs.csv'
demo = nonlib.classification_basis_comparison_3d.Visualizer(csvname)
# run animator
demo.brows_single_fits(num_units = [v for v in range(0,4)], basis = 'poly',view = [30,-80])
3D CLASSIFICATION FIT GOES HERE
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()
# plot the first 4 elements of the polynomial basis
demo.show_1d_net(num_layers = 1,activation = 'tanh')
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()
# plot the first 4 elements of the polynomial basis
demo.show_1d_net(num_layers = 1,activation = 'relu')
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='tanh',num_elements = [v for v in range(1,5)])
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_elements = [v for v in range(1,12)] ,view = [20,110],basis = 'net')
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()
# plot the first 4 elements of the polynomial basis
demo.show_1d_tree(depth = 1)
demo = nonlib.stump_visualizer_2d.Visualizer()
csvname = datapath + 'noisy_sin_raised.csv'
demo.load_data(csvname)
demo.browse_stumps()
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='tree',num_elements = [v for v in range(1,10)])
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_elements = [v for v in range(1,20)] ,view = [20,110],basis = 'tree')
demo = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'sin_function.csv'
demo.load_data(csvname)
demo.brows_fits(num_elements = [v for v in range(1,50,1)])
demo = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(cvsname)
demo.brows_fits(num_elements = [v for v in range(1,25)])
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='poly',num_elements = [v for v in range(1,25)])
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_cross_val(basis='poly',num_elements = [v for v in range(1,10)],folds = 3)
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_cross_val(basis='tanh',num_elements = [v for v in range(1,10)],folds = 3)
